Exploring Places

Screen%20Shot%202022-01-30%20at%2011.24.35.png

Experiments:

    1. Visualizing Places dataset
    1. Exploring Tags Places
    1. Exploring Towns & Places Names
    1. Exploring Properities
    1. Exploring Descriptions Places Similarities
    1. Descriptions Places Topic Modelling
In [1]:
import json
import pandas as pd
import plotly.express as px
import os
import plotly.graph_objects as go
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from bertopic import BERTopic
In [2]:
#data="places.json"
data="dataset/sample_20180501.json"
with open('dataset/sample_20180501.json', 'r') as f:
    data = json.load(f)
    print(len(data["places"]))
    places=data["places"]
df = pd.DataFrame(places)
1224

2. Visualizing the places dataframe

In [3]:
df["properties"].iloc[0]
Out[3]:
{'place.child-restrictions': True,
 'place.facilities.free-wifi': True,
 'place.facilities.dogs-allowed': False,
 'place.facilities.parking': True,
 'place.facilities.toilets': True,
 'place.facilities.toilets_disabled': False,
 'place.facilities.wheelchair-access': False,
 'place.capacity.max': '160'}
In [4]:
df.shape[0]
Out[4]:
1224

Experiment 1: Exploring Place Ids

In [5]:
df_ids=df.groupby(['place_id']).size().reset_index()
df_ids=df_ids.rename(columns={0: "number_of_times"}).sort_values(by=['number_of_times'], ascending=False)
df_ids
Out[5]:
place_id number_of_times
0 1 1
813 60849 1
820 61545 1
819 61451 1
818 61204 1
... ... ...
407 22375 1
406 22225 1
405 22215 1
404 22207 1
1223 114042 1

1224 rows × 2 columns

Experiment 2: Exploring Tags Places

We are going to separete the elements stored in each tag list into new rows.

In [6]:
df["tags"][0:5]
Out[6]:
0        [Bar & pub food, Comedy, Restaurants, Venues]
1    [Cinemas, Community centre, Public buildings, ...
2    [Arts Centre, Galleries, Language School, Publ...
3                         [Conference Centres, Venues]
4                                   [Theatres, Venues]
Name: tags, dtype: object
In [7]:
df_tags=df.explode('tags')
In [8]:
df_tags
Out[8]:
address email postal_code properties sort_name town website place_id modified_ts created_ts name loc country_code tags descriptions phone_numbers status
0 5 York Place admin@thestand.co.uk EH1 3EB {'place.child-restrictions': True, 'place.faci... Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand {'latitude': '55.955806109395006', 'longitude'... GB Bar & pub food [{'type': 'description.list.default', 'descrip... {'info': '0131 558 7272', 'box_office': '0131 ... live
0 5 York Place admin@thestand.co.uk EH1 3EB {'place.child-restrictions': True, 'place.faci... Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand {'latitude': '55.955806109395006', 'longitude'... GB Comedy [{'type': 'description.list.default', 'descrip... {'info': '0131 558 7272', 'box_office': '0131 ... live
0 5 York Place admin@thestand.co.uk EH1 3EB {'place.child-restrictions': True, 'place.faci... Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand {'latitude': '55.955806109395006', 'longitude'... GB Restaurants [{'type': 'description.list.default', 'descrip... {'info': '0131 558 7272', 'box_office': '0131 ... live
0 5 York Place admin@thestand.co.uk EH1 3EB {'place.child-restrictions': True, 'place.faci... Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand {'latitude': '55.955806109395006', 'longitude'... GB Venues [{'type': 'description.list.default', 'descrip... {'info': '0131 558 7272', 'box_office': '0131 ... live
1 10 Orwell Terrace NaN EH11 2DY NaN St Bride's Centre Edinburgh http://stbrides.wordpress.com 371 2019-12-04T13:27:26Z 2019-12-04T13:27:26Z St Bride's Centre {'latitude': '55.94255035', 'longitude': '-3.2... GB Cinemas [{'type': 'description.list.default', 'descrip... {'info': '0131 346 1405'} live
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1220 NaN EH32 0QB NaN Aberlady Local Nature Reserve Longniddry NaN 112611 2018-10-12T15:32:19Z 2018-10-12T15:32:19Z Aberlady Local Nature Reserve {'latitude': '56.01454821598324', 'longitude':... GB Outdoors NaN NaN live
1221 4 Picardy Place NaN EH1 3JT NaN Tokyo Bar And Nightclub Edinburgh https://www.facebook.com/tokyonightclubedin/ 112723 2018-10-15T16:54:59Z 2018-10-15T16:54:59Z Tokyo Bar And Nightclub {'latitude': '55.9569983', 'longitude': '-3.18... GB Clubs NaN {'info': '07378 413630'} live
1222 Edinburgh Road NaN EH49 6AB NaN Old Pavilion at Linlithgow Cricket Ground Linlithgow NaN 113099 2018-10-26T17:04:32Z 2018-10-26T17:04:32Z The Old Pavilion at Linlithgow Cricket Ground {'latitude': '55.97670900', 'longitude': '-3.5... GB Outdoors NaN NaN live
1223 19-21 George Street NaN EH2 2PB NaN Principal George Street Edinburgh https://www.phcompany.com/principal/edinburgh-... 114042 2018-11-29T17:04:07Z 2018-11-29T17:04:07Z The Principal George Street {'latitude': '55.95414000', 'longitude': '-3.1... GB Accommodation NaN {'info': '0131 225 1251'} live
1223 19-21 George Street NaN EH2 2PB NaN Principal George Street Edinburgh https://www.phcompany.com/principal/edinburgh-... 114042 2018-11-29T17:04:07Z 2018-11-29T17:04:07Z The Principal George Street {'latitude': '55.95414000', 'longitude': '-3.1... GB Hotels NaN {'info': '0131 225 1251'} live

3057 rows × 17 columns

In [9]:
g_tags=df_tags.groupby(['tags']).size().reset_index()
g_tags=g_tags.rename(columns={0: "number_of_times"}).sort_values(by=['number_of_times'], ascending=False)
g_tags
Out[9]:
tags number_of_times
233 Public buildings 224
317 Venues 224
217 Outdoors 180
242 Restaurants 155
234 Pubs & bars 150
... ... ...
163 Housing Association 1
164 IT 1
165 Ice Cream 1
166 Ice cream 1
349 tea shop 1

350 rows × 2 columns

In [10]:
px.histogram(g_tags, x="tags", y="number_of_times", histfunc="sum", color="tags", title='Frequency of tags places')

Experiment 3: Exploring Towns & Names

In [11]:
df["town"][1:10]
Out[11]:
1    Edinburgh
2    Edinburgh
3    Edinburgh
4    Edinburgh
5    Edinburgh
6    Edinburgh
7    Edinburgh
8    Edinburgh
9    Edinburgh
Name: town, dtype: object

3.1 Frequency of places grouped by towns

In [12]:
df_town=df.dropna(subset=['town'])
town=df_town.groupby(['town']).size().reset_index()
town=town.rename(columns={0: "number_of_times"})
town=town.drop([0])
In [13]:
town=town.sort_values(by=['number_of_times'], ascending=False)
town
Out[13]:
town number_of_times
45 Edinburgh 736
38 Dunfermline 38
121 St Andrews 31
37 Dunbar 17
71 Kirkcaldy 17
... ... ...
86 Lothianburn 1
33 Dairsie 1
34 Dalgety Bay 1
40 EH8 8BL 1
136 nr Dunbar 1

136 rows × 2 columns

In [14]:
px.scatter(town, x='town', y='number_of_times', color='number_of_times',  size="number_of_times", size_max=60, title="Frequency of places grouped by towns")

3.2 Frequency of places grouped by name

In [15]:
df_name_town=df.groupby(['name']).size().reset_index()
df_name_town=df_name_town.rename(columns={0: "number_of_times"})
df_name_town=df_name_town.sort_values(by=['number_of_times'], ascending=False)
df_name_town.reset_index()
Out[15]:
index name number_of_times
0 1167 Waterstones 7
1 1121 University of Edinburgh 2
2 308 Edinburgh Napier University 2
3 437 Holy Trinity Church 2
4 879 St Mary's Parish Church 2
... ... ... ...
1206 404 Halliwell’s House Museum 1
1207 403 Hallhill Healthy Living Centre 1
1208 402 Haddington School of Dance and Music 1
1209 401 Haddington Corn Exchange 1
1210 1210 theSpace on the Mile 1

1211 rows × 3 columns

3.3. Frequency of places grouped by name and town

In [16]:
df_name_town=df.groupby(['name', 'town']).size().reset_index()
df_name_town=df_name_town.rename(columns={0: "number_of_times"})
df_name_town=df_name_town.sort_values(by=['number_of_times'], ascending=False)
df_name_town
Out[16]:
name town number_of_times
1170 Waterstones Edinburgh 3
308 Edinburgh Napier University Edinburgh 2
1206 ZOO Charteris Edinburgh 2
1123 University of Edinburgh Edinburgh 2
13 52 Canoes Edinburgh 2
... ... ... ...
406 Harehead Farm Cranshaws 1
405 Hanover Tap Edinburgh 1
404 Halliwell’s House Museum Selkirk 1
403 Hallhill Healthy Living Centre Dunbar 1
1216 theSpace on the Mile Edinburgh 1

1217 rows × 3 columns

Experiment 4: Exploring Properities

In [17]:
df_properties=pd.concat([df.drop(['properties'], axis=1), df['properties'].apply(pd.Series)], axis=1)
In [18]:
df_properties[0:3]
Out[18]:
address email postal_code sort_name town website place_id modified_ts created_ts name ... place.child-restrictions place.facilities.dogs-allowed place.facilities.free-wifi place.facilities.guide-dogs place.facilities.hearing-loop place.facilities.parking place.facilities.toilets place.facilities.toilets_baby-changing place.facilities.toilets_disabled place.facilities.wheelchair-access
0 5 York Place admin@thestand.co.uk EH1 3EB Stand Edinburgh http://www.thestand.co.uk 1 2021-11-24T12:18:33Z 2021-11-24T12:18:33Z The Stand ... True False True NaN NaN True True NaN False False
1 10 Orwell Terrace NaN EH11 2DY St Bride's Centre Edinburgh http://stbrides.wordpress.com 371 2019-12-04T13:27:26Z 2019-12-04T13:27:26Z St Bride's Centre ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 West Parliament Square ifecosse.edimbourg-cslt@diplomatie.gouv.fr EH1 1RN Institut Français d'Ecosse Edinburgh http://www.ifecosse.org.uk 372 2021-02-23T16:57:44Z 2021-02-23T16:57:44Z Institut Français d'Ecosse ... NaN NaN False NaN NaN False False NaN False True

3 rows × 29 columns

4.1 Frequency of places grouped by wheelchair-access and town

In [19]:
df_properties_wc=df_properties.groupby(['place.facilities.wheelchair-access', 'town']).size().reset_index()
df_properties_wc=df_properties_wc.rename(columns={0: "number_of_times"})
df_properties_wc=df_properties_wc.sort_values(by=['number_of_times'], ascending=False)
df_properties_wc
Out[19]:
place.facilities.wheelchair-access town number_of_times
23 True Edinburgh 129
6 False Edinburgh 69
22 True Dunfermline 7
45 True St Andrews 5
38 True Musselburgh 3
30 True Kirkcaldy 2
32 True Livingston 2
12 False South Queensferry 2
34 True Loanhead 2
35 True Lochgelly 2
7 False Hawick 2
44 True Selkirk 2
4 False Dunfermline 2
25 True Falkland 2
31 True Linlithgow 1
33 True Livingston village 1
0 False Anstruther 1
39 True Newburgh 1
36 True Lothianburn 1
37 True Melrose 1
28 True Jedburgh 1
40 True North Berwick 1
41 True Peebles 1
42 True Peeblesshire 1
43 True Prestonpans 1
46 True St Monans 1
29 True Juniper Green 1
24 True Eyemouth 1
27 True Hawick 1
26 True Glenrothes 1
2 False Dalkeith 1
3 False Dunbar 1
5 False Duns 1
8 False Kirkcaldy 1
9 False Peebles 1
10 False Pittenweem 1
11 False Prestonpans 1
13 False St Andrews 1
14 False Wilkieston 1
15 True Aberlady 1
16 True Anstruther 1
17 True Auchtermuchty 1
18 True Bathgate 1
19 True Cockenzie 1
20 True Cupar 1
21 True Dirleton 1
1 False Bathgate 1
47 True Tranent 1

4.2 Frequency of places grouped by toilets_disabled and town

In [21]:
df_properties_td=df_properties.groupby(['place.facilities.toilets_disabled', 'town']).size().reset_index()
df_properties_td=df_properties_td.rename(columns={0: "number_of_times"})
df_properties_td=df_properties_td.sort_values(by=['number_of_times'], ascending=False)
df_properties_td
Out[21]:
place.facilities.toilets_disabled town number_of_times
23 True Edinburgh 117
7 False Edinburgh 73
22 True Dunfermline 6
30 True Kirkcaldy 3
44 True St Andrews 3
5 False Dunfermline 3
37 True Musselburgh 2
13 False Peebles 2
25 True Falkland 2
32 True Livingston 2
16 False St Andrews 2
15 False South Queensferry 2
27 True Hawick 2
42 True Prestonpans 2
38 True Newburgh 1
29 True Juniper Green 1
39 True North Berwick 1
40 True Peeblesshire 1
36 True Melrose 1
35 True Lothianburn 1
34 True Lochgelly 1
33 True Loanhead 1
41 True Pittenweem 1
43 True Selkirk 1
31 True Linlithgow 1
45 True St Monans 1
46 True Tranent 1
0 False Aberlady 1
24 True Eyemouth 1
28 True Jedburgh 1
11 False Lochgelly 1
2 False Bathgate 1
3 False Cupar 1
4 False Dunbar 1
6 False Duns 1
8 False Hawick 1
9 False Livingston village 1
10 False Loanhead 1
12 False Musselburgh 1
26 True Glenrothes 1
14 False Selkirk 1
17 True Anstruther 1
18 True Auchtermuchty 1
19 True Bathgate 1
20 True Cockenzie 1
21 True Dalkeith 1
1 False Anstruther 1
47 True Wilkieston 1

5. Exploring Descriptions

In [22]:
df_descriptions=df.explode('descriptions')
df_descriptions=pd.concat([df_descriptions.drop(['descriptions'], axis=1), df_descriptions['descriptions'].apply(pd.Series)], axis=1)
df_descriptions=df_descriptions.dropna(subset=['description']).reset_index()
documents=df_descriptions["description"].values
In [23]:
len(documents)
Out[23]:
404
In [24]:
import re 
from gensim.parsing.preprocessing import remove_stopwords
def clean_documents(text):
    text = re.sub(r'\S*@\S*\s?', '', text, flags=re.MULTILINE) # remove email
    text = re.sub(r'http\S+', '', text, flags=re.MULTILINE) # remove web addresses
    text = re.sub("\'", "", text) # remove single quotes
    text = remove_stopwords(text)
    return text
In [25]:
d=[]
for text in documents:
    d.append(clean_documents(text))

Generating Text Embeddings

In [26]:
model = SentenceTransformer('all-MiniLM-L6-v2')
#Training our text_embeddings - using the descriptions available & all-MiniLM-L6-v2 Transformer
text_embeddings = model.encode(d, batch_size = 8, show_progress_bar = True)

In [27]:
np.shape(text_embeddings)
Out[27]:
(404, 384)

Description Similarity

In [28]:
similarities = cosine_similarity(text_embeddings)
similarities_sorted = similarities.argsort()
id_1 = []
id_2 = []
score = []
for index,array in enumerate(similarities_sorted):
    p=len(array)
    id_1.append(index)
    id_2.append(array[-2])
    score.append(similarities[index][array[-2]])
index_df = pd.DataFrame({'id_1' : id_1,
                          'id_2' : id_2,
                          'score' : score})
print(index_df)
     id_1  id_2     score
0       0   155  0.438337
1       1    53  0.624832
2       2    12  0.488309
3       3   103  0.550572
4       4   103  0.648187
..    ...   ...       ...
399   399   227  0.603416
400   400   392  0.660965
401   401   286  0.487236
402   402   287  0.669923
403   403   371  0.274279

[404 rows x 3 columns]
In [29]:
index_df["score"].sort_values(ascending=False)
Out[29]:
61     0.889471
62     0.889471
86     0.844464
85     0.844464
133    0.840954
         ...   
349    0.337666
318    0.325064
232    0.323925
403    0.274279
321    0.217728
Name: score, Length: 404, dtype: float32
In [30]:
index_df.iloc[85]
Out[30]:
id_1     85.000000
id_2     86.000000
score     0.844464
Name: 85, dtype: float64

NOTE: Documents 61 and 62 seems to be the most similar. Lets see what they have

In [32]:
documents[61]
Out[32]:
'Five miles from Edinburgh city centre, Dalkeith Country Estate is home to some beautiful woodland, with bluebell walks, riverside trails, cycle tracks and picnic areas for families to enjoy. There’s also an excellent adventure park, with giant slides, tree tip walkways, rope swings and its famous flying fox zip slide.'
In [33]:
documents[62]
Out[33]:
"Five miles from Edinburgh city centre, Dalkeith Country Estate is home to some beautiful woodland, with bluebell walks, riverside trails, cycle tracks and picnic areas for families to enjoy. There’s also the excellent Fort Douglas Adventure Playground, with giant slides, tree top walkways, rope swings and its famous flying fox zip slide.\n\nOpened in July 2016, the brand new Dalkeith Country Park is an experience unlike any other. You can find the magical new Fort Douglas Adventure Playground alongside Restoration Yard, that holds The Kitchen Restautrant, the store and wellbeing lab wellbeing lab in the former stableyard which has been lovingly restored to create a truly special day out.\n\nThere's also the wider park to explore with waymarked walking and cycling trails to suit the whole family and special events too. Explore the Old Oak Wood with trees over 900 years old, enjoy a picnic in one of the areas we've created, or simply breathe in the fresh air of this beautiful country park, which you'll find hard to believe is just a few miles from Edinburgh's city centre."

6. Topic Modelling

In [39]:
topic_model = BERTopic(min_topic_size=10).fit(d, text_embeddings)
topics, probs = topic_model.transform(d, text_embeddings)
topic_model.visualize_topics()
In [40]:
topic_model.visualize_barchart()
In [41]:
topic_model.visualize_heatmap()
In [42]:
topic_model.get_topic_freq()
Out[42]:
Topic Count
0 0 211
1 -1 79
2 1 78
3 2 36